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Abstract 


We present a computational modeling framework for data-driven sim¬ 
ulations and analysis of infectious disease spread in large populations. For 
the purpose of efficient simulations, we devise a parallel solution algorithm 
targeting multi-socket shared memory architectures. The model integrates 
infectious dynamics as continuous-time Markov chains and available data 
such as animal movements or aging are incorporated as externally defined 
events. 

To bring out parallelism and accelerate the computations, we decom¬ 
pose the spatial domain and optimize cross-boundary communication us¬ 
ing dependency-aware task scheduling. Using registered livestock data at 
a high spatio-temporal resolution, we demonstrate that our approach not 
only is resilient to varying model configurations, but also scales on all 
physical cores at realistic work loads. Finally, we show that these very 
features enable the solution of inverse problems on national scales. 

Keywords: Computational epidemiology. Discrete-event simulation. Mul¬ 
ticore implementation. Stochastic modeling, Task-based computing. 

AMS subject classification: 68W10, 68U20 (Primary)-, 65C20, 65C40 
(Secondary). 

1 Introduction 

Livestock diseases have a major economic impact on farmers, the livestock in¬ 
dustry and countries [1, 2]. Modeling and simulation of infectious disease spread 
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is important in designing cost-efficient surveillance and control [3]. One chal¬ 
lenge is that disease dynamics and transmission routes for various pathogens 
are fundamentally different. Indirect transmission of pathogens via the envi¬ 
ronment for fecal-oral diseases requires a different model compared to diseases 
that spread with direct contact between individuals [4] . Another challenge is to 
incorporate the increasing amount of epidemiologically relevant data into the 
models [5]. It is therefore desirable to have simulation tools that are flexible 
to various disease spread models yet efficient to handle the large amounts of 
available livestock data. 

Due to uncertainties in the exact details in pathogen transmission [6] and the 
inherent random nature of animal interactions, stochastic modeling is natural 
and often required. Spatial models that include proximity to infected farms 
with local clustering of disease spread gained popularity during the Foot-mouth- 
disease epidemic in 2001 [7, 8, 9]. Another important route for disease spread 
is animal trade, creating a temporal network of contacts between farms [10]. It 
has been shown that the topology and connectivity of the network has great 
impact on the disease spread and on the effect of control measures [11, 12]. 

Stochastic models on discrete state-spaces are typically simulated using Dis¬ 
crete Event simulation (DES), a general approach to evolve dynamical systems 
consisting of discrete events including, in particular, continuous-time Markov 
chains (CTMCs) [13]. As most realistic epidemiological models are formulated 
on a large state-space and/or need to be studied over comparably long periods 
of time, parallelization is desirable. The highest degree of parallelism is typi¬ 
cally achieved by a decomposition of the spatial information, often represented 
as a graph or network, into a set of sub-domains [14] . It is then up to the strat¬ 
egy for event handling at domain boundaries how well the concurrent execution 
scales and which overall degree of parallelism is extractable. As it may hinder 
scalability, a constraint that plays a crucial role in the design of parallel DES is 
to maintain the sequential ordering of events, that is, to preserve the underlying 
causality of the model. 

In general, there are two types of boundary events that can occur during a 
simulation, which hence ultimately decide what will be the optimal paralleliza¬ 
tion strategy. Those which are deterministic and essentially of fully predictable 
character, and those which are stochastic and not predictable at an earlier sim¬ 
ulation time [15]. To parallelize events that belong to the latter group, sophisti¬ 
cated approaches such as optimistic parallel DES algorithms have been proposed 
[16]. These approaches may use speculative execution to enable scalability but 
must implement rollback mechanisms in case the event causality is violated [17]. 
Alternatively, in simulations where the domain crossing events are deterministic 
and thus predictable, conservative simulation may be used as it is possible to 
avoid causal violations altogether [14, 18]. In particular, a parallel scheduler 
[19] can be used to create an execution order which guarantees causality, as 
has been previously shown in [20, 21], notably with the focus on simulation of 
telecommunication networks. 

In this paper we present an efficient and flexible framework for data-driven 
modeling of disease spread simulations. The model integrates disease dynam- 
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ics as continuous-time Markov chains and real livestock data as deterministic 
events. This allows us to create a temporal network of disease transmission, 
which has been shown to be a key aspect in modeling and simulation of spatial 
disease spread [11, 12]. Previously, agent-based simulations based on synthetic 
data have been studied by others [22, 23]. 

The way the model is defined allows us to predict future boundary events at 
any simulation time, and hence we are able to create parallel execution traces 
which respect causality. In particular, we find that dependency-aware task 
computing [24, 25] can be used to implement this approach with high efficiency, 
as all the necessary information to maintain spatial and temporal causality of 
events can be specified via dynamic creation of tasks and dependencies. 

This is in contrast to previous approaches [20, 21], as the scheduler is not 
an implicit part of the parallel simulation algorithm, but can be chosen by 
the user from a wide selection of openly available libraries (e.g. Open MP 4.0, 
DmpsS [26], or StarPU [27]). We show how the selected library is integrated 
into our simulation framework, by assigning parts of the sequential algorithm to 
independent tasks that are scheduled using a certain set of rules. We evaluate 
this approach using the task-parallel run-time library SuperGlue [28], which 
has been demonstrated to be an efficient scheduler of fine-grained tasks. Using 
our simulator on models with realistic work-loads, we demonstrate scalability 
on a multi-socket shared-memory system and investigate when this approach 
is preferable in comparison to traditional parallelization techniques. As the 
achievable scalability clearly depends on the properties of the individual model, 
we in particular choose to investigate the influence of the model’s connectivity 
pattern. 

The paper is organized as follows. In §2 we introduce the mathematical foun¬ 
dation for our framework. In §3 we discuss the sequential simulation algorithm 
and the strategy for parallelization. In §4 we present numerical experiments car¬ 
ried out on benchmarks consisting of a recently proposed epidemiological model 
incorporating large amounts of registered data. We also include an example of 
an inverse problem for an epidemic model on national scales. Finally, in §5 we 
offer a concluding discussion around the central themes of the paper. 

2 Epidemiological modeling 

We consider in this section a highly general approach to epidemiological mod¬ 
eling. Proceeding stepwise we start with a description of single-node stochastic 
SIR-type models in the form of continuous-time Markov chains, using a compact 
notation that also encompasses externally defined events. We next couple an en¬ 
semble of such single-node models into a network with prescribed transitions in 
between the nodes to arrive at a global description. Finally, since most realistic 
models on multiple scales will typically incorporate also quantities for which a 
continuous description is more natural, we consider a mixed approach in which 
continuous-time Markov chains are coupled to ordinary differential equations 
(ODEs). 
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2.1 Discrete states 


We shall use a compact notation for jump stochastic differential equations (jump 
SDEs) as follows. We assume a probability space (il, P) where the filtration 
Tt>o contains Poisson processes of any finite dimensionality. The time depen¬ 
dent state vector Xt = X(t] uj) € with u; S 12, counts at time t the number 
of individuals of each of Nc different categories, or compartments. Since the 
random process is of discrete character, the map t ^ X{t) is right continuous 
only; by X{t—) we therefore denote the value of the state before any events 
scheduled at time t. 

Given a rate function r : > R_). and a stoichiometric coefficient s € 

Z^“, we write a continuous-time Markov chain in the form 


dXt = sfi{dt), ( 2 . 1 ) 

with scalar counting measure p,{dt) = iJ,(r{X{t—)); dt). This notation expresses 
a dynamics consisting of events with exponentially distributed waiting times 
of intensities r(X(t—)); specifically E[p,{dt)] = E[r{X{t—)) dt]. An event at 
time t implies that the state is to be changed according to the prescription 
X{t) = X{t—) + s. In (2.1), note that if some stoichiometric coefficient Si < 0, 
then we must have that r{x) = 0 for Xi small enough, or otherwise the chain 
will reach negative states with positive probability. 

The generalization of (2.1) to non-scalar counting measures is straightfor¬ 
ward. Assuming Nt different transitions specified by a vector intensity R : 
Z(^° —>■ R(^‘ and a stoichiometric matrix S € we simply write 

dXt=S(i{dt), (2.2) 

with pi{dt) = [fii{dt), ..., p,N^{dt)]'^ and, for each k, fik{dt) = p,{Rk{X{t—))-, dt). 
As a concrete example, consider the classical SIR-model [29] 


S + I A2/ 1 

I ^R j 


(2.3) 


With state vector X = [5', I, R] this can be understood as 


S = 


-1 0 

1 -1 

0 1 


R{x) = [f3xiX2,^X3]'^. 


(2.4) 

(2.5) 


With one small additional convention the above notation also encompasses 
events that have been defined externally. Suppose, for example, in the SIR- 
model, that susceptible individuals are to be added one by one at known deter¬ 
ministic times (ti). To accomplish this we replace (2.4) with 


S = 


-1 0 1 

1 -1 0 

0 10 


( 2 . 6 ) 
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and additionally define in terms of the Dirac measure, 

fJ-sidt) = ^ SiU; dt). (2.7) 

i 

Eq. (2.2) now evolves the full dynamics of the coupled stochastic-deterministic 
model. Note that when removing individuals using this scheme, some care is 
required to be able to guarantee a non-negative chain. 

2.2 Network model 

Although the previous discussion is of completely general character it makes 
sense to handle the collective dynamics of a possibly very large collection of 
nodes in a slightly more streamlined fashion. Assuming iV„ nodes in total we 
consider the state matrix X S 2 iN^xN„ evolve the local dynamics by a 
version of (2.2), 

dx["'> ( 2 . 8 ) 

Given an undirected graph Q each node i is modeled to affect the state of 
the nodes in the connected components C{i) of f, and in turn, to be affected 
by all nodes j such that i € C'(j). The interconnecting dynamics can then be 
written as 

^ Cu^^’^\dt). (2.9) 

Note that in (2.9), global consistency is enforced as follows. The fcth “outgoing” 
event is a change of state according to X(*^(t) = X*^*^(t—) — C^, and, for some 
j G C{i), X('^)(t) = By inspection the intensity for this transition 

is E[nl^'^\dt)] = dt], say, where the dependency is only on 

the state of the “sending” node i. 

Using superposition of (2.8) and (2.9) the overall dynamics becomes 

dxf^ =S/x«(df) - ^ (2.10) 

jeC(i) j;ieCU) 

As before we conveniently allow externally defined deterministic events to be 
included in this description using the equivalent construction in terms of Dirac 
measures. 

2.3 Continuous states 

In the previous description we assumed essentially that individuals were counted, 
such that a discrete stochastic model was needed to accurately capture the 
dynamics of a possibly small and noisy population. In a multiscale model, 
however, it makes sense to allow also continuous state variables, representing, for 
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example, environmental properties more naturally described in a macroscopic 
language. 

Assuming an additional continuous state matrix Y G to be available 

we find that a general model corresponding to (2.10) is 

( 2 . 11 ) 

- ^ g(xW(i-),r«(t))+ ^ g(x(^)(t-),yW(t)). 

jeC(i) j;*eC(i) 

Importantly, with this addition (2.10) can now depend on the continuous state 
variable. 


dt], (2.12) 

^l^k’'hdt)] = £;[iVf^'^(XW(t-),y«(t))dt], (2.13) 

where of course k is in the range where the dynamics is stochastic rather than 
defined externally from a database. 

Eqs. (2.10), (2.11), and (2.12)-(2.13) form the basis for our epidemiological 
computational framework, next to be described. 


3 Implementation 

In the following section we discuss the implementation details of our compu¬ 
tational framework. We begin with indicating how numerical methods can be 
consistently designed to approximate the mathematical model arrived at previ¬ 
ously. A description of the sequential solution algorithm and a presentation of 
the chosen parallelization strategy based on domain decomposition then follows. 
We propose to process events that cross domain boundaries as tasks and thus 
conclude with the introduction of dependency aware task computing and an 
associated scheduling scheme. 

3.1 Numerical Methods 

In order to be able to effectively incorporate finitely resolved temporal data as 
well as to obtain a parallelizable framework, we discretize time as 0 = tg < ti < 
t 2 < • • •. We thus write the epidemiological model in (2.10)-(2.11) in integral 
form, using a global notation which incorporates the whole network. 


X„_|_i — X„ -|- J GA(c?s), 

(3.1) 

Ptn + l 


r„+i = y„ + y F{x{s),Y{s))ds, 

(3.2) 


with the understanding that (X, F)„ = (X, F)(<„). 
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Typical numerical approaches to (3.1)-(3.2) are constructed via operator 
splitting and finite differences [30] . As a representable example we take 

Xjj+i = X„ + f GA(X(s—),yn; ds), (3-3) 


Yn+i = 


F ([X„ + X„_|_i]/2, Y (s)) ds. 


(3.4) 


In (3.3) we freeze the variable T at a previous time-step and integrate the 
stochastic dynamics only. Next, in (3.4) we insert an average effective value of 
X and integrate the deterministic part using any suitable deterministic numerical 
method. 

To describe a more concrete numerical method, some assumptions are in 
order. Firstly, in (2.10) we assume that events connecting two nodes have been 
externally defined. In particular, this assumption is satisfied for the important 
case of domesticated herds of animals who move between nodes due to human 
interventions only. Secondly, in (2.11) we put g = 0 and thus remove all direct 
influence between continuous variables in connected nodes. This is reasonable 
for macroscopic variables that are not easily transported, like bacterias in soil, 
but could of course be violated for other media like groundwater or air. 

For this scenario we can write down a concrete numerical method per node 
i as follows. 


4 - 


1 = XW + / " S/x«(X«(s-),tW; ds), 

‘^'sM«(XW(s-),rW; ds) 

- E ds) 

TE=T«+/(xE,FW)At„. 


(3.5) 

(3.6) 


(3.7) 


In (3.5) the stochastic part (subscript s) of the measure is evolved in time to 
produce the temporary variable X. Next, (3.6) incorporates all externally de¬ 
fined deterministic events (subscript d ), both locally on the node, and according 
to the connectivity of the network. Finally, (3.6) is just the usual Euler forward 
method in time with time-step At„ = — tn evolving the continuous state 

Y. The particular splitting method (3.5)-(3.7) forms the basis for much of the 
results reported in §4. 


3.2 External events 

Similar to the epidemiological events (2.4), the external events modify the dis¬ 
crete state according to a transition vector (2.6), but at a pre-defined time t. 
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We divide external events into two types; events of type Ei operate on the 
state of a single node, while events of type operate on the states of two 
nodes. It is meaningful to distinguish between these types of events as they are 
processed differently by the parallel algorithm discussed later. They are defined 
by a set of attributes 

El = {R,t,n,i}, (3-8) 

E 2 = (3.9) 

where t is the time of the event, K the transition vector, n the number of 
individuals affected and i and j the indices of the affected nodes. This is a 
minimal set of attributes which can be further extended for specific models. As 
an example, within the context of the SIR model (2.3) we can define a birth 
event {K, t, n, i} of type Ei with the transition vector K = [1, 0,0]^. 

In the actual implementation, the transition vector is a column of the sto¬ 
ichiometric matrix (2.6) that is indexed by the event. When the event is pro¬ 
cessed at time t, it changes the state of node i according to 

= Xf ^-MRn. (3.10) 

The overall spatial domain of the model can be understood as a graph G = 
{V,E). The edges E result from events of type E2 acting on source and desti¬ 
nation nodes X^*^ and 

3.3 Sequential simulation algorithm 

The sequential simulation algorithm is divided into three parts; the process¬ 
ing of stochastic events (3.5), hereafter referred to as the stochastic step, the 
processing of external events (3.6), or deterministic step, and the update of the 
continuous state variable (3.7). These steps are processed repeatedly in the 
above-mentioned order until the simulation reaches the end. 

The stochastic step (Algorithm 1, p. 29) is an adaptation of Gillespie’s Direct 
Method [31]. The algorithm generates a trajectory from a continuous-time 
Markov chain. At first, the rates LOn^ for all stochastic events n = 1... Nt 
are evaluated in all nodes Xj,i = 1...7V„. Then, in each node we sum up 
transition rates into Next, the algorithm uses inverse transform 

sampling to obtain an exponentially distributed random variable representing 
the next stochastic event time for each node X(*\ 

rW =-log(rand)/A(*). (3.11) 

Here, rand denotes a uniformly distributed random number in the range (0,1). 
To obtain the index of the stochastic event that occurred within the node X^*) 
we generate a new random number rand and find n such that 

n—1 n 

^a;j(XW) < A(*) rand < ^Wj(X«). 
i=i 
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(3.12) 


When n is found, we compute the state update according to the transition 
matrix (2.4), setting + S„ and simulation time t = t + Finally, 

(0 

to obtain the new next event time, the rate ojn of the event just occurred and 
all its dependent events need to be re-computed as in (3.11). For fast execution, 
these dependencies are stored in a dependency graph that is traversed at this 
stage. The algorithm repeats until a defined stopping time is reached, where 
the external events will next be processed. 

The deterministic step works as a read and incorporate algorithm. It moves 
through the list of external events and processes them at the defined event time. 
In particular, if the event specifies a single compartment where the transition 
occurs, it can be directly applied to X^®). 

Finally, the continuous state variable is updated. As discussed in §3.1, in this 
step different numerical methods can be applied. Note that the thus updated 
continuous state generally affects the rate of stochastic events Thus, before 
the simulation proceeds with the next iteration of the stochastic step, the event 
times need to be be rescaled [32] using 



The implementation of the algorithm is written in C. The overall design is in¬ 
spired and partly adapted from the Unstructured Mesh Reaction-Diffusion Mas¬ 
ter Equation (URDME) framework [33, 34]. 

3.4 Parallel simulation algorithm 

The parallelization starts with a decomposition of the spatial domain of the 
model understood as a graph G = (U, E). The target of this graph partitioning 
problem is to divide the set of vertices V of size Nn into k approximately equally 
sized sub-domains Ui, U 2 , 14- The cutting of edges E follows straightforwardly 

from the consecutive assignment of vertices to sub-domains. This partitioning 
strategy does not guarantee a minimum amount of edge cuts, but as the distri¬ 
bution of edges is predominantly homogeneous in our data, we believe that the 
partitioning will not benefit from more sophisticated approaches. Nonetheless, 
if edges are distributed heterogeneously, a Minimum Bisection algorithm [35] 
may generate an optimized cut that contributes to a better performance of the 
parallel solver. 

After partitions Vi,,k are defined, the preprocessing algorithm continues to 
rearrange the external events into a structure that is more convenient for parallel 
processing. Firstly, the external events of type Ex are assigned to k lists ff, 
such that all Ex events affecting the nodes X^*^ € 14 are stored in the A:th list. 

Second, external events of type E 2 are divided in two categories; external 
events of type E 2 where the source and destination nodes lie within the same 
sub-domain I 4 are assigned to lists Events in lists and can be pro¬ 
cessed by a thread assigned to the fcth sub-domain in private. Events of type 
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E 2 where the source node and destination node do not lie within the same sub- 
domain Vfc, are assigned to a second list This list then contains domain 
crossing external events that have to be handled by the simulator in a special 
way. 

The complexity of the data-rearrangement is 0(n), where n is the number 
of external events. Although n can be large, the workload is typically negligible 
for real-scale models. For example, in the national scale model presented in 
§4.1, the operation takes ~ 0.1% of the total simulation time on one core and 
~ 1% of the simulation time on 32 cores, respectively. Moreover, re-arranged 
lists can be stored and re-used for models that are simulated using the same 
decomposition, as is done in the simulation study in §4.3. 

Finally, the decomposed problem can be simulated in parallel. For simplic¬ 
ity, let us assume that each sub-domain k is bound to one computing thread. 
Then every thread processes the stochastic step (3.5) and the update of the 
continuous variable (3.7) on private nodes of 14, as well as the deterministic 
step (3.6) on external event lists and £2 (Algorithm 2, p. 30). Since time 
has been discretized, these computations are embarrassingly parallel, in that no 
communication between neighboring threads is necessary during the processing 
of the nth time window + At„. The potential bottleneck of the simulation 
lies in the simulation of the cross-boundary events in £|. 

In our study we handle events in in different ways. The hrst possibility 
is to compute them entirely in serial. This is a valid approach if there are very 
few events in £3 in relation to the private events as the scaling of the private 
computations will not be affected. On the other hand, if the overall simulation 
is dominated by the processing of £| events, it can be regarded as serialized, as 
little concurrency will be extractable using such an approach. Hence we focus 
on an intermediate ratio of private and global work, where events in £3 occur 
at every deterministic time step At, but at a lower frequency than the other 
private events. We will also investigate if scaling in this regime is achievable if 
£3 events are scheduled using dependency-aware task-computing. 

3.5 Task-based computing 

An increasing amount of scientihc computations are parallelized using task- 
based computing [36, 37, 38]. In order to apply this pattern the programmer 
typically has to divide a larger chunk of work into a group of smaller tasks which 
can be processed asynchronously. A run-time library [24, 25] is then used to 
create an execution schedule of the tasks on the available parallel hardware. 

If the granularity of tasks is sufficiently fine, the schedule will be denser 
and the idle time shorter. On the other hand, the scheduler synchronizes a 
larger number of small tasks which usually implies more overhead. See [39] for 
a thorough discussion on the impact of granularity. 

If the scheduler supports dependency-awareness [40], the programmer can 
further define a number of task dependencies. This is a critical feature if data 
is shared between tasks and therefore a processing order has to be enforced. 
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The scheduler then manages the dependencies in the form of a Directed Acyclic 
Graph (DAG) and spawns tasks whenever all dependencies are met. 

We believe that the usage of task-based computing is beneficial in our compu¬ 
tational framework, as a small granularity of processes is given by the underlying 
modeling. In our approach we aim to divide our computations into tasks and 
define a scheduling policy which guarantees causality of events although they 
are processed in parallel. 

These scheduling rules can be implemented on any dependency aware task 
scheduler, the only requirement for some of the scheduling policies is the support 
for dynamic addressing of a sub-set of dependencies, e.g. via an array of pointers. 
For example. Open MP 4.0 does not support this [41]. In our computational 
experiments, we make use of the run-time library SuperGlue [28]. In SuperGlue, 
dependencies are assigned to data and expressed via data versioning [42]. If a 
chunk of data is being processed by a task, a version counter representing the 
data access will be increased. Other tasks that are dependent on the chunk will 
be spawned whenever the new version becomes available. SuperGlue has been 
demonstrated to be an efficient shared-memory task-scheduler that it is capable 
of operating at a comparably low synchronization overhead. The processing of 
dependencies and spawning of tasks is dynamic, and SuperGlue additionally 
supports load balancing by work stealing from over-utilized threads. 

3.6 Scheduling and dependencies 

We now define the tasks and their dependencies that are used in the task- 
based implementation of the parallel algorithm. Task 7s(fc,n) executes private 
computations on the decomposed data of the kt\i sub-domain (lines 5 and 6 of 
Pseudo-code 2). That is the stochastic step (3.5) on all nodes X„ € 14, the 
processing of the private external events in lists £i and £2 (3.6), as well as the 
update of the continuous variables (3.7). The counter n indicates the iteration 
of the time window -|- At„. 

Task 7m processes state updates due to the domain crossing events stored in 
list £ 2 - In order to estimate the possible impact of granularity of Tm tasks, we 
compare two different scheduling policies; if the task is constructed for coarse¬ 
grained processing, we compute all £2 events occurring at the nth time window 
At in one single task. Thus, the task takes only one argument, Tmin). 

If tasks are constructed for fine-grained processing, we schedule each event 
in £2 as a distinct task. We then denote the task by TM{ki,k 2 ,i), where fci 
and ^2 are the sub-domains subject to an A 2 -event update, in which two nodes 
e Vi and X^"*) e V 2 are affected. The counter i now denotes the total 
order of the B 2 events in £2 as given by the model input. This implies that if 
1.. .n events exist in the window [t„,t„+i], they have to be processed by the 
task in this order. 

Both tasks 7s and 7m are scheduled repeatedly until the simulation reaches 
its end time. Precedence dependencies between tasks are expressed using the 
‘^’-operator. For example, Ts(ki,n) Ts{ki,m) means that task Ts{ki,n) 
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must complete its execution before task Ts{ki,m) is spawned. Our task-based 
implementation contains the following dependencies: 

1. 7s(fc, n) ^ Ts(k, to) if n < to, to maintain the causality of private updates 
of sub-domain 14. 

2. To maintain the causality of domain crossing events: 

(a) 7m (n) ^ 7m (to) if n < to, at coarse-grained processing, 

(b) TMika,kb,n) -< TM{kc,kd,m) if n < to and kc G {fco,fc&} or kd G 
{ka, kb}, at fine-grained processing. 

3. To maintain the causality between private sub-domain updates and domain¬ 
crossing events: 

(a) {Ts{ki,m),Ts{k2,m),... ,Ts{kn,m)} 7m( m) for all sub-domains 
Vi,... 14 that will be affected by an E 2 events processed in task 
Tmi'm), at coarse-grained processing, 

(b) {Ts{ka,n),Ts{kb,n)} -< TM{kc,kb,i) ifi G and G {ka,kb} 

or kd G {ka,kb}, at fine-grained processing. 

The presented processing policies lead to a different utilization of the task 
scheduler. Firstly, the task 7 m will be of different size, which leads to a different 
synchronization behavior. Second, rules 3(a) and 3(&) imply that a different 
amount of dependencies will be created for each single task 7 m- In the fine¬ 
grained case, a task 7 m is spawned when two dependencies are met. In the 
coarse-grained case, the number of dependencies per task is set dynamically at 
runtime and can potentially be larger. This can clearly have an impact on the 
bookkeeping overhead. 

4 Computational experiments 

In the following section we present results of computational experiments of our 
simulator. The following measurements were obtained on Sandy, a Dell Power 
Edge R820 computer system equipped with four Intel Xeon E5-4650 processors 
and 8 cores on each socket. We restricted the execution to available physical 
cores, as timing results on hyper-threads were strongly fluctuating. We begin 
with a real-world simulation using animal movement data on national scales, 
followed by a synthetic benchmark for scalability at varying connectivity load, 
and we conclude with a compute-intensive parameter estimation example. 

4.1 National scale simulation of VTEC bacteria spread 

Verotoxigenic Escherichia coli 0157:117 (VTEC 0157) is a zoonotic bacterial 
pathogen with the potential to cause severe disease in humans, notably children 
[43, 44, 45]. Cattle infected with VTEC 0157 are an important reservoir for 
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Figure 3.1: Scheduling trace of the task-based approach; Tasks Tm (red color) 
are processing aggregated or single E 2 events while tasks Ts (other colors) com¬ 
pute private work on partitioned sub-domains. As coarse-grained tasks control 
a higher number of dependencies, blocking may occur. Fine-grained scheduling 
leads to better interleaving but higher overhead cost. 


the bacteria and they shed the bacteria in the feces without any signs of clinical 
disease [46]. Reducing the prevalence of infected cattle in the population could 
potentially reduce the number of human cases. However, the epidemiology 
of VTEC 0157 in cattle is complex and targeted interventions to control the 
bacteria require a thorough understanding of the source and transmission routes 

[46]. 

To explore the feasibility of national scale simulations to improve the under¬ 
standing of the underlying disease spread mechanisms, we have created a model 
of the VTEC 0157 dynamics, using the presented framework. European Union 
legislation requires member states to keep register of bovine animals includ¬ 
ing the location and the date of birth, movements between holdings, and date 
of death or slaughter [47, 48] . These records enable data-driven disease spread 
simulations that include spatio-temporal dynamics of the cattle population with 
regard to age structures, births, herd size, slaughter, and trade patterns. 

The present computational experiment is based on all cattle reports to the 
Swedish Board of Agriculture over the period 2005-07-01 to 2013-12-31. From 
these reports, three types of Ei external events {enter, exit, aging), and a single 
type of E 2 event {animal movements) were condensed. In total there were 
^ 10® external events processed during the total runtime of T = 3106 days. We 
let each integer in 0,1,... ,T represent a synchronization window for external 
events, where in each window 3707 ± 670 Ei events and 235 ± 104 E 2 events 
were processed. A subset of the spatial network consisting of = 37221 nodes 
is visualized in Figure 4.1. 

Most infected cattle shed the bacteria less than 30 days before returning 
to the susceptible state, but calves shed for a longer period than adult cattle 
[49, 50]. To capture this we let the intensity of the transitions between the 
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Figure 4.1: 

Visualization of 
cattle movements 
in the VTEC 0157 
disease spread sim¬ 
ulation (§4.1). The 
arcs shown are a 
random subset of 
the complete dataset 
of ~ 10® recorded 

events. The source 
of the data is the na¬ 
tional cattle register 
at the Swedish Board 
of Agriculture. 
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states depend on the jth age category, 





(4.1) 


The rate for a susceptible individual on the ith node to become infected per 
unit of time is given by 

r]j = (4.2) 

for j = 1,... ,Nn and j S {calves, young stock, adults}. In turn, the expected 
time an infected individual is in an infected state before it returns to the sus¬ 
ceptible state is 


u 



(4.3) 


where 6 = [28, 25, 22] and v = [8, 7,1] x 10“^ are age-dependent constants. The 
factor u can be understood as a time-scale and is difficult to estimate accurately; 
in our experiments it is in fact varied such that u = 1 closely resembles the 
parameterization of the model found in [51]. 

Finally, the continuous variable ipi represents the environmental bacterial 
concentration that asserts an infectious pressure on each individual at the ith 
node. A suitable model is given by 


difi 

dt 


Si,j{t) + ii,j{t) 


I3{t)(p^{t). 


(4.4) 


Again, i = 1,..., are the nodes and Sij and lij refers to the number 
of susceptible and infected individuals in the jth age compartment at the ith 
node, respectively. The constant a is the average shedding rate of bacteria to 
the environment per infected individual, while /3 captures the decay and removal 
of bacteria. In our experiments we used the constant value a = 1 while (3{t) 
varied according to the season. 


m 


log(2)/14: 0 < (t mod 365) < 91 

log(2)/26 : 91 < (t mod 365) < 182 

log(2)/20 : 182 < {t mod 365) < 273 

log(2)/12 : 273 < {t mod 365) < 364 


(4.5) 


We first parallelized the simulation by spreading tasks 7s over multiple cores 
using Open MP and serially processing the intermediate events, hereafter re¬ 
ferred to as the fork-join approach. Next, we simulated the model using the task- 
based approach, scheduling tasks with coarse-grained and fine-grained policies 
as described in §3.6. We chose the number of sub-domains fc to be a multiple of 
the number of threads c. Note that this is also the number of tasks Ts scheduled 
for each time window At. As a higher factor u creates a higher load for the tasks 
7s, we vary u to inspect boundary regions of the parallel performance. 
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u=1, coarse-grained scheduling 


u=10, coarse-grained scheduiing 




u= 1 , tine-grained scheduiing 


1=10, fine-grained scheduiing 




Figure 4.2: Performance measurements of the VTEC 0157 model simulation 
on Sandy at varying scheduling approaches, task sizes, and scale factor u. The 
number of tasks k is chosen to be proportional to the number of threads c. In 
the Open MP parallelization (“forkjoin”), cross-boundary events are processed 
entirely in serial. Error bars represent the standard error in mean (n=10). 
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Figure 4.3: Parallel efficiency of the VTEC 0157 model simulation on Sandy 
at varying factor u. For all task-based approaches the task size k = 16c. Error 
bars represent the standard error in mean (n=10). 
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The scaling of the different approaches is shown in Figure 4.2. For the case 
of M = 1, we find that task sizes are too small to be efficient in both task-based 
approaches, and thus the fork-join approach reaches a higher efficiency. 

At u = 10, we observe that coarse-grained processing performs better than 
the fork-join parallelization, optimally at task sizes k = 16c and k = 32c. In the 
case of fine-grained task processing, we found that the choice of k has a strong 
impact on the performance scaling. While all task densities scale strongly at 
a lower thread count, only the k = 16c density reaches a high efficiency of 
0.58 at full thread consumption. Thus the efficiency is more than doubled in 
comparison to the efficiency of the fork-join parallelization, which was found to 
be 0.23. 

The dependency of the parallel efficiency on the factor u is further detailed 
in Figure 4.3. We observe that scheduling overhead and small task sizes prohibit 
a high efficiency of both task-based approaches if u < 1 while the full potential 
of the approaches is extractable at u > 1 and a larger thread count. Note that 
the thread affinity of tasks was varied throughout the performed experiments 
in order to investigate the impact of data locality. 

We further present a set of characteristics of the coarse-grained and fine¬ 
grained simulations in Table 4.1 and 4.2. As shown in Table 4.1, the granularity 
of the fine-grained task 7 m is ~ 1/30 of the granularity of the coarse-grained 
task 7m, however the task needs to be scheduled about 235 times more often 
throughout the simulation run. 

On the other hand, the advantage of the fine-grained scheduling is empha¬ 
sized by the measurements shown in Table 4.2; the average waiting time to fulfill 
the dependencies for the fine-grained 7m is 44— 108 x lower than for the coarse¬ 
grained 7 m. This is explained by the larger number of dependencies associated 
to the coarse-grained 7m which is growing with the partitioning k. 

The resulting execution trace is also visualized in Figure 3.1, where we show 
that fine-grained Tm tasks interleave more densely with tasks 7s, thus leading 
to lower idle times and higher parallel efficiency. The percentage of total work 
spent on the processing of tasks, the synchronization of worker threads, as well 
as the time spent waiting for fulfilled dependencies are shown in Figure 4.4 for 
the M = 10 configuration. 

For further details of the scheduling performance of the SuperGlue library 
in regards to the task sizes and the number of dependencies, we like to refer the 
reader to the benchmarks available in [28] . 



Task granularities (10^ 

cycles) 

Number of tasks 

k 

Ts 

Tm coarse 

7m fine 

Ts 

Tm coarse 

Tm fine 

256 

596 ± 799 

314 ± 132 

11 ± 5.2 

794112 

3102 

731889 

512 

301 ± 461 

328 ± 145 

11 ± 5.2 

1588224 

3102 

731889 

1024 

152 ± 294 

327 ± 145 

11 ± 5.2 

3176448 

3102 

731889 


Table 4.1: Average task granularity ± the standard deviation, and the total 
number of tasks created during the simulation at a given partitioning k. 
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Waiting for dependencies (10^ cycles) Number of dependencies 


k 

Ts coarse 

7m coarse 

Ts fine 

7m fine 

Ts 

7m coarse 

Tm fine 

256 

15 ± 10 

550 ± 200 

12.5 ± 5 

12.5 ± 5 

1 

146 ± 33 

2 

512 

12.5 ± 5 

850 ± 200 

12.5 ± 5 

12.5 ± 5 

1 

202 ± 58 

2 

1024 

12.5 ± 5 

1350 ± 500 

11 ± 4 

11.5 ± 4 

1 

248 ± 83 

2 


Table 4.2: Maximum ± full-width-half-maximum of the (right-skewed) his¬ 
togram of waiting times for fulfilled task dependencies, and the average amount 
± standard deviation of dependencies assigned to tasks at each discrete time 
interval at a given partitioning k on 32 computing cores. 


, 100 - 

: 80 - 
i 

i 60 - 

I 40 - 

I) 

I -- 
' 0 — 


7m 

Ts 

idle 

sync 


Coarse, k=8c Coarse, k=16c Coarse, k=32c Fine, k=8c 


Figure 4.4: Percentage of total work spent on processing of tasks, synchro¬ 
nization (“sync”), and time spent waiting for dependencies (“idle”) for various 
scheduling policies when measured on 32 cores. Note that the relation of work 
to overhead agrees well with Figure 4.2. 


4.2 Synthetic benchmark 

The results in §4.1 indicate a delicate performance dependency on the balance 
between the local events and the effective connectivity of the network. To 
further investigate this a synthetic benchmark with a fixed load of local events 
was created. This model consists of two compartments S and I only, both 
residing on iV„ = 1000 nodes. The transitions are simply 


S hi, 
I h S, 


(4.6) 


where the initial population size of each compartment li and Si was set to 
1000. This model is considered at times t = [0, At, 2At,.. .T], with At = 1 and 
T = 1000, thus generating about 2000 local events per synchronization time 
window At and node i. 

The nodes were arranged into k sub-domains and a total of pk{k — l)/2 dis¬ 
tinct E 2 events were generated at the end of each time window, each connecting 
two randomly sampled nodes and X^-^^ belonging to different sub-domains. 
Hence p = 1 means that all sub-domains have to communicate with all other 
sub-domains at each synchronization point. The number of tasks for the coarse¬ 
grained and fine-grained approach was set to the number of threads {k = c). 
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32 threads 



-0- forkjoin 
sg_coarse 
-V- sg_fine 


Figure 4.5: Parallel efficiency for the different methods on the synthetic bench¬ 
mark. Error bars represent the standard error in mean (n=10). 


The measurements obtained on the Sandy computer system at full thread 
consumption are presented in Figure 4.5. The parallel efficiency of all methods 
lies at ^ 0.7 for p < 0.1 and remains there even when p —>■ 0 and so we 
deduce that the problem is memory bound. The coarse-grained task-based 
implementation and the fork-join approach scale very similarly with increasing 
connectivity p. The fine-grained task-based approach attains the highest parallel 
efficiency at p < 0.1, but the performance drops at a higher global connectivity. 
This phenomenon arises because each 7M-task creates dependencies on two 
subsequent Tg-tasks at every synchronization window, thus creating a higher 
overhead and limiting asynchronous task execution. 

4.3 Feasibility of parameter estimation 

A usually very compute intensive load case is the fitting of model parameters, 
typically using numerical optimization of some kind. The problem can briefly 
and ideally be described as follows; unknown is the set of parameters k* and 
an observed time-series of data X(t; k*). The parameters k* are estimated by 
repeatedly simulating a whole family of trajectories with parameters fc, where 
k is modified until input data and simulations match up in some suitable sense. 
The framework’s feasibility of fitting an epidemiological model is of course very 
important, since the modeling process at some point or the other will involve 
calibration of parameters with respect to reference data. 

To demonstrate the feasibility of parameter estimation in the current con¬ 
text, we use the epidemiological model introduced in §4.1 and first identify the 
set of parameters [fci, ^ 2 , fca] that have a high degree of observability. A suitable 
such set is 


kj = Vj5j, j G {calves, young stock, adults}. (4-7) 

We let a reference solution be given by a single trajectory X(t; k*), with 5* = 
[28, 25, 22], V = [8.8,3.2,1] x 10“^, and with a and /3 = /3(t) as in §4.1. 
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To obtain a robust procedure, some kind of smoothing statistics should be 
considered. We chose to aggregate counts of animals in neighboring nodes into 
larger regions. To be precise, the overall domain was divided into 21 areas 
(coinciding with the Swedish county codes), after which the goodness of fit 
G{k) was defined by 


Gikr 



(4.8) 


with 


k) = k), 

and where the individual sample trajectories are given by 


(4.9) 


x){t; fc) = y] k, uji), 
lecu) 


(4.10) 


where N is the number of trajectories and G{j),j € {1,...,21} is the set of 
nodes that belong to county j. To quantify the uncertainty in G{k) we 
compute the variance as 


VG{kf 



(4.11) 


where 


XI k)f, ( 4 . 12 ) 

i 

and use -ii2VG{k)/y/N as a measure of the uncertainty. 

Next, the parameter estimation problem is approached by solving the mini¬ 
mization problem 

fc = argminG'(A:)^. (4-13) 

k 

In practice we make use of the Pattern Search routine in [52], which concep¬ 
tually resembles the Golden Section search [53] in its narrowing of the search- 
space. The numerical optimization routine evaluates (4.8) until the residual 
error reaches a defined threshold. In our tests we varied the initial guess of 
the parameters fco but found that the results did not vary substantially. In the 
results below, we conveniently put ko = 1.6k*. 

Since an increasing number of trajectories yields better estimates of the mean 
and variance, we simulate using different number of trajectories. We measure 
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the total solver time on 12 and 32 computing cores, respectively, and we let the 
total number of iterations to be TV = 20 in all cases. The results are presented 
in Table 4.3 where the relative residual is defined as 


R{k) 


\G{k)-G{k*)\ 

\G{k*)\ 


(4.14) 


The optimization landscape of the goal function (4.8), and hence the definiteness 
of the setup itself, is visualized in Figure 4.6. Due to the simple bisection search 
behavior of the numerical routine, the obtained parameters k are in fact the 
same for all displayed cases, although the relative residuals differ considerably. 


Trajectories 

Rel. residual 

Time (c=12) 

Time (c=32) 

10 

0.1738 

46.6 min 

30.2 min 

20 

0.0900 

94.2 min 

61.5 min 

40 

0.0363 

189.3 min 

123.7 min 


Table 4.3: Solver time of the parameter estimation problem on 12 and 32 cores, 
respectively, and using a different number of simulated trajectories. 

Note that the obvious approach of parallelization by computing the N in¬ 
dependent trajectories using separate threads by a sequential algorithm is un¬ 
favorable here, for two related reasons. Firstly, each executable needs to store 
a rather large state space in working memory. Secondly, each simulation must 
also access the complete database of externally scheduled events. 

5 Conclusions 

Modeling and simulation are important in designing surveillance and control 
of livestock diseases and of major economic importance. However, various 
pathogens require different models to capture the disease dynamics and trans¬ 
mission routes. Moreover, an increasing amount of epidemiologically relevant 
data is becoming available. We have adressesed these challenges and present a 
flexible and efficient computational framework for modeling and simulation of 
disease spread on a national scale. The simulation involves two parts. Firstly, 
the algorithm evolves stochastic dynamics of the disease process. Secondly, a 
list processor incorporates database events such as entering, exit, or movement 
of individuals into the model state. The framework is highly flexible in that 
most conceivable epidemiological models are either directly expressible, or the 
framework may be straightforwardly extended to encompass also non-standard 
models. As a concrete example it would be relatively easy to include interven¬ 
tion strategies such as vaccination programs in order to simulate the impact on 
the global dynamics. 

We have explored different strategies to parallelize the simulator on multi¬ 
socket architectures. Firstly, we decomposed the spatial information and the list 
of deterministic events. We then observed that the decomposed problem can be 
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Figure 4.6: Goal function G{k) in the form of confidence intervals G{k) ± 
2VG{k), visualized for each parameter kj, j = {1,2,3}, when the other pa¬ 
rameters ki, i ^ j, are held at the target value k*. Vertical lines indicate the 
target k* and the obtained estimates The parameter ranges have been 

normalized for ease of comparison. 


simulated at a high parallel efficiency, which is limited only by the processing 
of cross-boundary events. We then created three parallel implementations of 
the simulator core; we used Open MP to only parallelize private computations in 
a fork-join fashion, while cross-boundary events were processed in serial. Two 
further implementations use a dependency-aware task scheduler to create ex¬ 
ecution traces that interleave cross-boundary events and private computations 
with respect to their dependencies. We find that this strategy allows us to 
exploit shared-memory parallelism at a higher degree than the fork-join ap¬ 
proach if task sizes are sufficiently large. We benchmark this approach using 
the SuperGlue task library, but present a set of scheduling rules defining the 
parallel simulator on general terms, thus allowing it to be implemented also 
with other dependency-aware task libraries. 

We benchmarked our simulator using a model of the spatio-temporal spread 
of VTEC 0157 bacteria in the Swedish cattle population. The model contains 
37221 nodes and evolves ~ 10® external events from register data. We found that 
at a low private work load, the fork-join approach performs best, mainly due to 
the scheduling overhead of the task-based approaches. For higher private work 
loads, the simulation benefits from task-based computing, doubling the parallel 
efficiency on 32 cores in comparison to the fork-join approach. 

To further inspect the performance dependency on network properties, we 
constructed a synthetic benchmark where cross-boundary events were generated 
randomly. Here we found that the performance of the fork-join approach and 
the coarse-grained task approach scales well with a growing amount of cross- 
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boundary events. Notably, the performance of the fine-grained task processing 
depends more strongly on the connectivity of boundary crossing events, thus 
favoring a more fragmented network. 

In a final example we used the simulator to carry out an experimental pa¬ 
rameter fitting within the VTEC 0157 bacteria spread model. We emphasize 
the high computational complexity of this task with multiple unknown param¬ 
eters to fit and the need to use several full simulation runs to evaluate each 
parameter candidate. A similar load case results when different intervention 
strategies are to be evaluated. For example, even when several interventions re¬ 
duce the infectious spread globally, a policy maker could be interested in finding 
the most cost-efficient strategy. With this work, we provide a powerful, highly 
general and freely available software, that can contribute to a rapid and more 
efficient development of realistic large-scale epidemiological models. 

Future research will encompass studies of larger inverse problems, includ¬ 
ing more realistic data input, and more complex dynamics. Yet another point 
for future study is the scalability of the task-based approach in a distributed 
environment. 
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A Algorithms 


Algorithm 1 Sequential simulation loop 

1: Initialize: Compute all stochastic rates Wi in all nodes i = 1,..., A„. 

2: while t < TEnd do 

3: for all nodes i=l to do 

4: while t < {tn + At) do 

5: Compute the sum A of all transition intensity functions. 

6: Sample the next stochastic event time by r = —log(rand)/\ using a 

uniformly distributed random variable rand. 

7: Determine which event happened. Sample the next event (by inver¬ 

sion); find n such that X)J=i) < A rand < 

8: Update the state using the stoichiometric matrix S. 

9: Update ojn using the dependency graph G to recalculate only affected 

stochastic rates. 

10: end while 

11: end for 

12 : tjiJ^i = tn Atji 

13: Incorporate externally defined events in lists £ 1 ^ 2 - 

14: Loop over all nodes X^*^ and update the continuous state variable 

15: end while 
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Algorithm 2 Parallel simulation loop 

1 : Initialize: Decompose the nodes V into k sub-domains 14. Re-arrange the 
external events of type Ei into private lists of each sub-domain k, where 
all n affected nodes X„ G 14. Further divide all i ?2 events into the private 
list £2 or the list of domain-crossing events ff- 
2: -while t < TEnd do 
3: for all i=l to k do 

4: % Parallel task 7s; 

5: Execute line 14 of Algorithm 1 for all nodes in sub-domain I 4 if n > 1. 

6: Execute lines 3-10 of Algorithm 1 for all nodes in sub-domain I 4 evolv¬ 

ing time t G [tn,tn + At]. 

7: Execute line 13 of Algorithm 1 for all events in lists and at time 

t G [tmtn + At]. 

8: % End of parallel task 7s 

9: end for 

10: % Parallel task Tm', 

11: Execute line 13 of Algorithm 1 for all events in the list at time t G 

[tn {tn + At)]. 

12: % End of parallel task Tm] 

13: tyj^i =: tn ^tn 

14: end while 
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